Add separate PVC for tiered storage cache#239
Conversation
db077d2 to
7c22df7
Compare
Pre-Flight Review —
|
| Concern | Status |
|---|---|
| Crash safety / re-entrancy | Correct — state written before replacement PVC created |
| Terminating pod = absent | Correct |
| Grow path (no special handling) | Correct |
| CC + log.dirs exclusion | Correct |
| CRD backward compatibility | omitempty on all new fields |
Findings
HIGH — Fix before merging
C1 — deleteRemovedCachePVCs deletes a mounted PVC
pkg/resources/kafka/kafka.go
When a TieredStorageCache: true storage config is removed while the broker pod is still running and no CacheVolumeStates entry exists for that mount path, handleBrokerCacheResizeCleanup returns early (len(cacheVolumeStates) == 0), then deleteRemovedCachePVCs runs unconditionally and issues a Delete against the actively-mounted PVC. The PVC-protection finalizer prevents immediate data loss, but the delete is issued unintentionally.
cleanupOrphanedDuplicateCachePVCs (correctly) gates on !brokerPodExists. deleteRemovedCachePVCs must do the same:
if !brokerPodExists {
if err := r.deleteRemovedCachePVCs(ctx, log, brokerId, pvcList, desiredMountPaths); err != nil {
return err
}
}MEDIUM — Strongly recommended
C4 — Missing test: crash re-entry with no replacement PVC
pkg/resources/kafka/kafka_test.go
No test covers: initialCacheVolumeState = CacheResizePendingDeletion, only the old PVC exists (reconciler crashed between writing state and creating replacement), pod running. Expected: no Delete, no state clear, one Create (replacement re-staged). The invariant is described in comments but untested.
A1 — Empty-string sentinel for map deletion is an implicit contract
pkg/k8sutil/status.go:182
CacheResizeState("") silently deletes map entries in generateBrokerState. The established pattern is DeleteVolumeStatus. An uninitialized CacheResizeState variable passed by a future caller would delete entries with no compile error. Consider a named constant (CacheResizeStateNone) or a dedicated DeleteCacheVolumeStatus helper.
A2 — "tieredStorageCache" annotation key duplicated as bare string 9+ times
pkg/resources/kafka/kafka.go, pvc.go, kafka_test.go
A typo at any read site silently treats a cache PVC as a regular data volume. Define a named constant (e.g., TieredStorageCacheAnnotation = "tieredStorageCache") following the BrokerIdLabelKey precedent.
A5 — Tiered-cache PVC guard predicate duplicated at 3 sites
pkg/resources/kafka/kafka.go:1298, 1326, 1339
pvc.DeletionTimestamp != nil || pvc.Annotations["tieredStorageCache"] != annotationTrue appears three times. Extract to a helper (isTieredCachePVC).
SE1 — PVC deletion gated only on mutable annotations, no OwnerReference check
pkg/resources/kafka/kafka.go — deleteRemovedCachePVCs, cleanupOrphanedDuplicateCachePVCs
The List selector (LabelsForKafka + BrokerIdLabelKey) and annotation filter (tieredStorageCache, mountPath) have no OwnerReference verification. A principal with patch access to PVCs in the namespace could craft a matching PVC to trigger deletion. Add an ownership guard before deleting in both helpers.
SE2 — Annotation backfill enables two-step privilege escalation
pkg/resources/kafka/kafka.go:1608
A KafkaCluster CR writer (without direct PVC delete access) can: (1) set TieredStorageCache: true on an existing data volume's storage config — operator backfills the annotation; (2) remove the storage config entry — deleteRemovedCachePVCs deletes the now-annotated data PVC, bypassing the CC disk-removal workflow. Consider restricting backfill to ClaimPending PVCs or adding a validation webhook.
D1 — Rollback mid-resize leaves two PVCs indefinitely with no recovery path
docs/tiered-storage-pvc-resize.md
If the operator is rolled back while pending-deletion state is set, the old operator ignores the unknown field — two PVCs coexist at the same mount path with no automatic recovery and no visible signal in kubectl describe kafkacluster. Add a "Rollback safety" section with detection steps and a manual recovery procedure.
D2 — No Kubernetes Events for resize lifecycle
pkg/resources/kafka/kafka.go
The entire resize lifecycle is visible only in operator logs. Emit record.Event(...) at minimum when pending-deletion is first written and when cleanup completes. This is the standard on-call triage interface; without it, a broker restart from a resize is indistinguishable from a crash without operator log access.
LOW — Optional improvements
| ID | Finding | Location |
|---|---|---|
| C3 | Add unit test for C1's scenario (config removed, pod running, no prior state) | kafka_test.go |
| A3 | annotationTrue used for both annotation values and Kafka config comparisons — split or inline in configmap.go |
kafka.go:72 |
| A4 | skipBroker bool return from handleBrokerCacheResizeCleanup couples helper to caller's loop structure |
kafka.go:360 |
| S3 | Two new //nolint:funlen lack rationale comments |
kafka.go:1402, 1552 |
| D4 | UpdateBrokerStatus(ConfigOutOfSync) called every cycle during rolling restart; guard may miss due to stale in-memory state |
kafka.go:~1653 |
| D6 | Pre-cleanup in e2e uses 10-min tsResizePhaseTimeout for a step that should complete in seconds |
test_tiered_storage_cache_resize.go:228 |
Rejected Findings
| ID | Reason |
|---|---|
| C2 | getCreatedPvcForBroker deduplication target is irrelevant — reconcileDesiredPvcsForBroker independently identifies the correct PVC by size |
| A6 | Per-PVC r.List in inner loop is pre-existing; PR doesn't worsen it |
| SE3 | Log verbosity preference; mount paths/PVC names are standard operational metadata |
| D3 | SETUP_ENVTEST_VERSION pin has explicit rationale comment; intentionally excluded from renovate |
| D5 | Namespace set explicitly on the line before requireDeleteKafkaCluster; code comment documents the pattern |
A KafkaCluster CR writer could flip tieredStorageCache from false to true on an existing storageConfigs entry, causing the operator to silently exclude the volume from log.dirs and Cruise Control capacity, and subsequently route a shrink through the delete-and-recreate path that bypasses graceful disk drain — destroying live log data. Fix in three layers: 1. Validating webhook rejects flips on existing storageConfigs entries (matched per mountPath, scoped per BrokerConfigGroup and per Broker). 2. deleteRemovedCachePVCs is now gated on !brokerPodExists so a removal while the broker pod is still running cannot issue a Delete against a mounted PVC. 3. The tiered-cache classification trusts only the persisted PVC annotation, never the desired-spec annotation. The annotation backfill block is removed, since it converted the unprotected CR field into a persistent ground-truth tag on live data PVCs. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Two related correctness issues in the tiered storage cache resize path: (HIGH-3) When the client times out on the replacement-PVC Create but the server has already accepted the request, the next reconcile cycle's cache hasn't observed the new PVC yet. The size filter in reconcileDesiredPvcsForBroker then excludes the old PVC, alreadyCreated remains false, and a second Create with a fresh GenerateName produces a duplicate replacement. Add a strongly-consistent (DirectClient) read before falling through to Create, gated on CacheResizePendingDeletion state at this mountPath. If a replacement at the desired size already exists, reuse it. (HIGH-2) cleanupOrphanedDuplicateCachePVCs existed specifically to mop up the duplicates produced by the staging race above. Its own implementation had a cache-lag bug: after handleBrokerCacheResizeCleanup deleted the old PVC and cleared the in-memory state in the same cycle, the cached PVC list still showed the deleted PVC as Bound (no DeletionTimestamp yet), so the function classified the legitimate replacement as a duplicate and deleted it. With the staging race closed, the cleanup function has no remaining purpose, and keeping a function whose implementation has a race that deletes legitimate replacements is worse than removing it. Delete the function and its call site. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…tant The PVC annotation key "tieredStorageCache" was a bare string literal at five sites across pvc.go and kafka.go (write at PVC creation, three reads gating cache-specific behavior, one test fixture). A typo at any read site would silently misclassify a cache PVC as a regular log-dir PVC, sending it through Cruise Control rebalance — silent data loss surface. Define TieredStorageCacheAnnotationKey in api/v1beta1 alongside the existing label/annotation key constants and reference it everywhere. Compile-time-checked from now on. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Add three test cases to TestReconcileKafkaPvcTieredCacheResize covering the recovery paths exercised by the HIGH-3 idempotency fix: 1. Pending-deletion state with running pod and no replacement — the reconciler crashed between writing state and creating the replacement. Asserts: replacement is re-staged, old PVC is not deleted, state is preserved (since the resize hasn't completed). 2. Pending-deletion with replacement present (symmetric) — the inner-match loop's size filter correctly identifies the replacement when both PVCs are visible to the cached client. No duplicate Create. 3. Pending-deletion with replacement visible only via DirectClient (asymmetric) — simulates the cache-lag window where Create succeeded server-side but the watch event hasn't propagated. The cached Client sees only the old PVC; the idempotency check uses DirectClient and detects the replacement. This case is the actual HIGH-3 regression guard: removing the DirectClient lookup makes it fail. The test setup now wires a separate mockDirectClient so cached and uncached views can diverge per-test-case. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Add tiered storage cache as a separate PVC
Summary
Kafka log volumes.
replacement is created immediately (provisioning overlaps with gate evaluation), and the old PVC is deleted once the broker pod stops. A grow continues to use the standard Kubernetes in-place expansion path. Full details in
docs/tiered-storage-pvc-resize.md.
Test plan